4  Exploratory Analysis

4.1 Reading in the data

The observed yields can be found in the file yields.csv.

To read yields.csv into R, you will first need to download it from Moodle and save it somewhere accessible. Then, set your working directory in R to point to wherever you have saved this file. For a reminder on setting your working directory, see Lab 3 Section 2.1.

To load the data frame into the Environment tab and call it yields, we can use the code below.

yields <- read.csv(file = "yields.csv")

It is always good practise to open the original file to check, for example, for any missing values and the column headings used. Make sure to check that the data frame is saved in R as you would expect. Once you have read yields into your R session, we should explore its contents.

head(yields)
  fertiliser   crop  yield
1       Used potato 41.171
2       Used potato 34.423
3       Used potato 38.761
4       Used potato 36.662
5       Used potato 42.636
6       Used potato 37.052

We can see that there are 3 columns in yields,

  • fertiliser: this is a categorical variable stating whether fertiliser was used or not in the field.

  • crop: this is a categorical variable stating whether the crop grown in the field was potatoes or wheat.

  • yield: this is the yield (in tonnes) of the crop harvested from each field.

Because fertiliser and crop are columns of categorical data, we should switch them to be factors. In the code below, note that the argument levels = is used with fertiliser and crop because they are already labelled in the dataset

yields$fertiliser <- factor(x = yields$fertiliser, 
                            levels = c("Used", "Not used"))

yields$crop <- factor(x = yields$crop, 
                      levels = c("potato", "wheat"))

4.2 Summary statistics

Since we are interested in finding any differences between the mean yields of potatoes from fertilised and unfertilised fields, it makes sense to split the data into distinct subsets. potato_fertilised and potato_unfertilised, created using the code below, store the potato yields where fertiliser was used and fertiliser was not used respectively.

potato_fertilised <- subset(x = yields,
                            subset = (fertiliser == "Used" & crop == "potato"))

potato_unfertilised <- subset(x = yields,
                              subset = (fertiliser == "Not used" & crop == "potato"))

Task

Similarly, we are interested in differences between the mean yields of wheat from fertilised and unfertilised fields.

Create two subsets, wheat_fertilised which stores the wheat yields from fields where fertiliser was used, and wheat_unfertilised which stores the wheat yields from fields where fertiliser was not used.

wheat_fertilised <- subset(x = yields,
                           subset = (fertiliser == "Used" & crop == "wheat"))

wheat_unfertilised <- subset(x = yields,
                             subset = (fertiliser == "Not used" & crop == "wheat"))

Now that we have sensible subsets from the data, we can summarise the yield of crops from each type of field in some way. It would be appropriate to find the sample mean yield, for example, in each case and present these in a table within the statistical report.

The mean() function can be used to find the mean of a vector of numerical values. The only argument that the mean() function takes is,

  • x =: this is the vector of values to calculate the mean of. This can be numeric or logical.

Task

Use the function mean() to find the mean yield of potatoes from each type of field, and the mean yield of wheat from each type of field.

Within the Exploratory Analysis section of your report, include the values you calculate in a table similar to the one below. Make sure to caption your table appropriately, and write a short paragraph which introduces where these values come from.

Fertilised Unfertilised
Potato
Wheat

In order to find the mean yield in each case we’re interested in, the following code can be used.

mean(x = potato_fertilised$yield)
mean(x = potato_unfertilised$yield)
mean(x = wheat_fertilised$yield)
mean(x = wheat_unfertilised$yield)
[1] 38.75082
[1] 38.59715
[1] 23.80124
[1] 17.98154

These values can then be rounded to 2 decimal places and input into a table in the report such as,

Table 4.1: Sample mean yield of potato and wheat yields from fields where fertiliser was and was not used.
Fertilised Unfertilised
Potato 38.75 38.60
Wheat 23.80 17.98

For any output we include in the Exploratory Analysis section of the statistical report, we want to explain the initial impressions that this output provides and what we might then expect a formal answer to the aims to be. We can use Table 4.1 to informally compare the yield of each crop in the presence and absence of fertiliser.

Task

How do the sample mean yields of potatoes from fertilised and unfertilised fields compare to each other?

Task

How do the sample means of wheat yields from fertilised and unfertilised fields compare to each other?

4.3 Exploratory plots

In order to informally answer the aims of our analysis, it is also a good idea to produce some appropriate plots that will graphically summarise the information contained in the data. For a reminder of how to create various plots in R, see Lab 4.

4.3.1 Aim 1

Is there a difference between the mean yields of potatoes from fields that were fertilised and fields that were unfertilised?

Task

What might some sensible plots for summarising the yield of potatoes from the two types of fields be?

To investigate our first aim, we will use a histogram here but there are other sensible options for presenting quantitative data. A histogram allows us to roughly see the spread of the distribution of potato yields from the sample and consider things such as,

  • whether the population means of the two groups might be equal.

  • if the population standard deviations look approximately equal between the groups.

The code below can be used to create two side-by-side histograms of potato yields from fertilised and unfertilised fields. So that they are comparable, the binwidths and range of the axes are forced to be the same. This is done by specifying the same sequence in the breaks = argument and the same range of the y-axis in the ylim = argument for both histograms.

par(mfrow = c(2, 1))

hist(x = potato_fertilised$yield,
     breaks = seq(from = 25, to = 50, by = 2.5),
     col = "darkseagreen",
     ylim = c(0, 30),
     main = "Potato Yield from Fertilised Fields",
     xlab = "Yield (tonnes)")

hist(x = potato_unfertilised$yield,
     breaks = seq(from = 25, to = 50, by = 2.5),
     col = "darkseagreen1",
     ylim = c(0, 30),
     main = "Potato Yield from Unfertilised Fields",
     xlab = "Yield (tonnes)")

par(mfrow = c(1, 1))

Now, we can consider some questions that might be useful to informally answer the aim, or inform the statistical method we could use to formally answer it.

Task

How do the sample mean yields of potatoes from fertilised and unfertilised fields compare to each other?

Task

How do the sample standard deviations of potato yields from fertilised and unfertilised fields compare to each other?

Task

Add the histograms you have created in R to the Exploratory Analysis section of your statistical report. Make sure to caption your figure appropriately.

Write a paragraph that describes what the histograms show, the initial comparisons between the two groups of potato yield and how this relates to the aims of the analysis.

Once you have created a plot in the Plots tab of R studio, saving it to your clipboard is easy!

Copying plots from the Plots tab in R studio. You can resize the plot in the pop-up window so it is a useful size. Then, once you click Copy Plot it is ready to paste wherever you need it.

When it comes to formally looking for a statistical difference between the mean potato yields from the two types of field (we’ll look at this in depth in Lab 9), it may be useful to know whether the yields from both types of field follow a normal distribution. We could use the histograms produced above to comment on this assumption of normality, but another useful to plot to show is a QQ plot.

The code below produces two side-by-side QQ plots; one for the potato yields from fertilised fields and one for the potato yields from unfertilised fields.

par(mfrow = c(1, 2))

qqnorm(y = potato_fertilised$yield,
       main = "Potato Yield from Fertilised Fields")
qqline(y = potato_fertilised$yield)

qqnorm(y = potato_unfertilised$yield,
       main = "Potato Yield from Unfertilised Fields")
qqline(y = potato_unfertilised$yield)

par(mfrow = c(1, 1))

Task

How would you describe the normality of potato yields from fertilised and unfertilised fields from the sample?

Task

Add the QQ plots you have created in R to the Exploratory Analysis section of your statistical report. Make sure to caption the figure appropriately.

Write a paragraph that describes what the QQ plots show, describing how closely the yields appear to follow a normal distribution.

4.3.2 Aim 2

Is there a difference between the mean yields of wheat from fields that were fertilised and fields that were unfertilised?

Our second aim is similar to the first, but focuses on wheat yields rather than potato yields. Therefore, the same type of plots will be useful to us here, as long as this time they are produced using the subsets wheat_fertilised and wheat_unfertilised created earlier.

A histogram of wheat yield from fertilised fields and from unfertilised fields will allow us to consider things such as,

  • whether the population means of the two groups seem to be equal

  • if the population standard deviations look approximately equal between the groups.

The code below creates two side-by-side histograms of wheat yields from fertilised and unfertilised fields. The binwidths used, set by the break = argument, and the range of the y-axis, set by the ylim = argument, are kept consistent across the two histograms so they are comparable.

par(mfrow = c(2, 1))

hist(x = wheat_fertilised$yield,
     breaks = seq(from = 12, to = 32, by = 1),
     col = "brown",
     ylim = c(0, 40),
     main = "Wheat Yield from Fertilised Fields",
     xlab = "Yield (tonnes)")

hist(x = wheat_unfertilised$yield,
     breaks = seq(from = 12, to = 32, by = 1),
     col = "brown1",
     ylim = c(0, 40),
     main = "Wheat Yield from Unfertilised Fields",
     xlab = "Yield (tonnes)")

par(mfrow = c(1, 1))

The histograms can help give an initial impression of what an informal answer to the second aim might be.

Task

How do the sample means of wheat yields from fertilised and unfertilised fields compare to each other?

Task

How do the sample standard deviations of wheat yields from fertilised and unfertilised fields compare to each other?

Task

Add the histograms you have created in R to the Exploratory Analysis section of your statistical report. Make sure to caption your figure appropriately.

Write a paragraph that describes what the histograms show, the initial comparisons between the two groups of wheat yield and how this relates to the aims of the analysis.

To help us choose a suitable statistical method to formally answer whether there is a difference in the mean yields of wheat from fertilised and unfertilised fields, we want to assess whether each group of yields follows a normal distribution. This can be done using QQ plots again.

The code below creates two side-by-side QQ plots; one for the wheat yields from ferilised fields and one for the wheat yields from unfertilised fields.

par(mfrow = c(1, 2))

qqnorm(y = wheat_fertilised$yield,
       main = "Wheat Yield from Fertilised Fields")
qqline(y = wheat_fertilised$yield)

qqnorm(y = wheat_unfertilised$yield,
       main = "Wheat Yield from Unfertilised Fields")
qqline(y = wheat_unfertilised$yield)

par(mfrow = c(1, 1))

Task

How would you describe the normality of wheat yields from fertilised and unfertilised fields from the sample?

Task

Add the QQ plots you have created in R to the Exploratory Analysis section of your statistical report. Make sure to caption the figure appropriately.

Write a paragraph that describes what the QQ plots show, describing how closely the yields appear to follow a normal distribution.

Task

Save your statistical report so far. It should now contain a full Introduction and Exploratory Analysis section.

We will add to the same statistical report in Lab 9 by working on the Statistical Analysis and Conclusion.

So far, we have seen from the exploratory analysis that:

  • the sample mean yield of potatoes from fertilised and from unfertilised fields are approximately equal.

  • the sample standard deviations of potatoes from fertilised and from unfertilised fields are approximately equal.

  • the yields of potatoes from fertilised and unfertilised fields both approximately follow a normal distribution.

  • the sample mean yield of wheat from fertilised is greater than the sample mean yield of wheat from unfertilised fields.

  • the sample standard deviation of wheat yields from fertilised fields is greater than the sample standard deviation of wheat yields from unfertilised fields.

  • the yields of wheat from fertilised and unfertilised fields both approximately follow a normal distribution.